The Problem

Python is everywhere — but nobody knows what the code actually does

Data pipelines grow in complexity. Notebooks fragment into unmaintainable scripts. Column-level lineage is invisible. Documentation is always out of date.

73%

of data pipeline failures trace back to undocumented column transformations

No one knows which columns came from where. When upstream schemas change, downstream breaks are invisible until production fails.

60%

of data team time spent on code archaeology, not development

Reading old notebooks, tracing pandas chains, deciphering variable names — engineers spend more time understanding code than writing it.

4×

slower to audit Python pipelines vs. SQL-native workflows

Compliance and audit teams can't trace Python code paths. Every regulatory review turns into a multi-week manual exercise.

0%

of Jupyter notebooks have accurate, up-to-date documentation

Documentation is written once, never updated. New team members spend weeks reverse-engineering what a pipeline does and why.

Visual Notebooks

Notebooks that think with you — not just run for you

PyFluent's visual notebook environment combines the familiarity of Jupyter with AI assistance, live lineage visualization, and inline STTM generation — all without leaving your flow.

✦

AI-Integrated Code Cells

Every cell has an AI co-pilot that understands your full notebook context. Generate transformations, explain complex logic, refactor chains, or detect bugs — inline, without switching tools.

🔗

Live Lineage Sidebar

As you write, a live lineage graph updates in real time alongside your notebook. See which datasets feed which outputs and trace every column's origin without running the code first.

📊

Interactive Data Previews

Inline table views, schema cards, and distribution charts render directly beneath each cell output. Explore your data visually without writing separate profiling code.

🗂️

Versioned Cell History

Every cell edit is tracked with diffs. Compare execution results across versions, rollback individual cells, and annotate changes — full Git-level traceability at the cell granularity.

⚡

Smart Cell Ordering

PyFluent analyzes your cell dependency graph and warns when execution order will produce incorrect results. Automatically suggests the correct sequence before you hit Run All.

📤

One-Click Export

Export to production Python modules, FastAPI endpoints, Airflow DAGs, or Spark jobs directly from the notebook. PyFluent strips notebook scaffolding and produces clean, typed output.

Visual Lineage & Project Metrics

Visual lineage map and project dependency metrics

Auto Documentation Output

Platform Walkthrough

Five steps from raw code to production

PyFluent covers the complete lifecycle — analyze, convert, execute, validate, and accelerate with AI — in one integrated platform.

Step 01

Analyze. Inventory. Lineage.

Scan SAS, DataStage, Informatica, Teradata BTEQ, PL/1, and JCL to auto-build a complete inventory. Discover dependencies, macro chains, external calls, data sources, and fan-in/fan-out hot spots. Produce visual lineage and impact maps that guide the entire modernization.

Inventory all workflows, macros, and configurations
Dependency mapping with visual lineage (file + data)
Code complexity analysis, block labels, and LoC assessment

Inventory Lineage Complexity Validation Risk

Step 02

Convert. Generate modern code.

Parser-driven conversion to Python, PySpark, Snowpark, and SQL for Snowflake, Databricks, BigQuery, Redshift, and Fabric. All translations are explainable and auditable — no black boxes.

Interprets and converts legacy code structures with matched outputs
Translated workflows to notebooks
Auto documentation for each converted artifact

Python PySpark Snowpark SQL Auto docs

Python and PySpark code conversion targets — Python and PySpark. Snowpark and SQL.

Step 03

Execute. Orchestrate pipelines.

Run converted workloads in the correct order with a driver notebook or job runner. Standardize on Delta and cloud storage, schedule, monitor, and auto-retry — with centralized logs and metrics.

Visual execution on Databricks and Snowflake
Native integration with DBT, Airflow, and Git
Validate results and capture lineage at each step

Visual orchestration Scheduling Retries Logs CI ready

Step 04

Validate. Prove parity.

Partitioned validation compares row-level and aggregate outputs between legacy and modern systems. Automatic schema checks, data matching reports, and exception trails give confidence to go live.

Row counts and aggregate comparisons against legacy outputs
Streamlines troubleshooting — audit-ready logs cut retesting time
Visual Lineage shows upstream/downstream impact — retest only what matters

Row counts Common columns Mismatched columns Evidence

Data matching validation reports — Data matching. Evidence your stakeholders trust.

Step 05

Merlin AI. Assist and accelerate.

Context-aware AI assistance that knows your inventory, lineage, and conversion plans. Generate unit tests, explain diffs, suggest mappings, and draft notebooks with your rules applied — securely inside your environment.

Inline explanations for every converted module
Debug errors and improve efficiency with contextual fixes
Enterprise-safe — runs entirely in your environment

Inline explains Mapping assist Test scaffold Secure in your env

Merlin AI assistant — Developer assist powered by your context.

Data Lineage & STTM

Column-level lineage — captured automatically from Python code

PyFluent instruments your Python pipelines at parse time to extract source-to-target mappings, transformation logic, and dependency graphs — no annotations, no decorators required.

Pipeline Lineage Graph — sales_pipeline.py

Source

sales.parquet

→

Transform

filter_apac()

→

Aggregate

calc_summary()

→

Source

products.csv

→

Join

enrich_skus()

→

Output

report_final

Source-to-Target Mapping (STTM)

Source Column	Source Dataset	Transformation	Target Column	Target Dataset	Type
revenue	sales.parquet	SUM(revenue)	total_revenue	report_final	AGG
order_id	sales.parquet	NUNIQUE(order_id)	order_count	report_final	AGG
region	sales.parquet	FILTER(region=APAC)	region	report_final	DIRECT
product_sku	sales.parquet	JOIN key	product_name	report_final	LOOKUP
list_price, units	products.csv, sales.parquet	list_price * units	gross_value	report_final	COMPUTED

🔍

Zero-Annotation Capture

PyFluent parses your Python AST at import time. No decorators, no schema files, no manual mapping — lineage is captured from plain Python code automatically.

📐

Impact Analysis

Before changing any column, instantly see every downstream function, DataFrame, export, and report it affects. Color-coded risk scores surface breaking changes pre-commit.

📦

Cross-File Dependency Graph

Lineage spans across modules, scripts, and notebooks. Import chains, function calls, and dataset handoffs are mapped into a single project-wide dependency graph.

Automatic Documentation

Documentation that writes itself — and stays current

PyFluent generates rich, human-readable documentation from your actual code and lineage metadata. No templates. No manual effort. Always accurate.

Input: Your Python Code

def calc_summary(df):
  return (
    df
    .groupby("region")
    .agg({
      "revenue": "sum",
      "order_id": "nunique"
    })
  )

Generated: Docstring + STTM

"""
Aggregates sales data by region.

Args:
  df: Sales DataFrame with columns
      revenue (float), order_id (str),
      region (str)

Returns:
  DataFrame — regional summary:
  - total_revenue: SUM(revenue)
  - order_count: NUNIQUE(order_id)

Lineage:
  revenue → total_revenue [AGG]
  order_id → order_count [AGG]
"""

Generated: Data Dictionary

## calc_summary output

| Column        | Type    | Source      |
|---------------|---------|-------------|
| region        | str     | pass-through|
| total_revenue | float64 | SUM(revenue)|
| order_count   | int64   | NUNIQUE(id) |

Quality rules:
- total_revenue >= 0
- order_count > 0
- region IN known_regions

Advanced Execution Framework

From notebook to production — without rewriting anything

PyFluent's execution framework understands your pipeline's dependency graph and runs it optimally — parallel where possible, sequential where required — across local, Spark, or cloud environments.

⚙️

Dependency-Aware DAG Runner

PyFluent builds a DAG from your pipeline's data dependencies and executes independent branches in parallel automatically. No Airflow config files. No manual wiring.

🚀

Multi-Target Execution

Run the same pipeline locally for development, on Spark for scale, or serverless on AWS/GCP/Azure. Target-specific optimizations applied automatically per environment.

🔄

Incremental & Checkpoint Execution

Smart checkpointing resumes from the last successful step. Incremental mode processes only new or changed partitions — cutting runtime by up to 80% on large datasets.

🧪

Lineage-Aware Testing

PyFluent generates data contract tests from your STTM automatically. Run regression suites that validate column-level transformations against expected outputs without writing a single assert.

📈

Execution Profiler

Every run produces a flame graph of cell and function execution time, memory peaks, and shuffle costs. Identify bottlenecks in seconds without adding profiling instrumentation.

🗓️

Native Scheduler

Schedule notebooks and pipelines directly from PyFluent — cron, event-driven, or API-triggered. No separate orchestration platform required for most enterprise workloads.

Notebook Cell Execution & Logs

Cell-by-cell notebook execution with code output and logs

Cell Analysis & Performance Metrics

Cell analysis panel with execution time charts and success metrics

Visual Execution

See every step. On Snowflake and Databricks.

Visual execution runs directly on Snowflake and Databricks — combining lineage and live code in one workspace with a direct warehouse session and step-by-step visibility to any failure point.

→One view: visual lineage + live code + direct session. See each step and the exact stop point.
→Streamlines troubleshooting, cuts retesting, provides audit-ready logs.
→Visual Lineage shows upstream and downstream impact — retest only what matters.

Visual Execution on Snowflake and Databricks

AutoBot — PySpark at Scale

Manage, execute, and monitor PySpark notebooks — production-ready

AutoBot is a production-grade platform for hierarchical PySpark notebook execution with real-time monitoring, dependency safety, performance analytics, and enterprise operations built in.

🗂️

Hierarchical Notebook Execution

Organize master and child notebooks with dependency management. Define execution order, handle failures gracefully, and run complex multi-stage PySpark pipelines from a single entry point.

📡

Real-Time WebSocket Monitoring

Live execution updates streamed directly to your dashboard via WebSockets. See each notebook's progress, current stage, and failure point the moment it happens — no polling, no refresh.

📈

Performance Metrics & Anomaly Detection

Capture per-notebook runtime, memory, shuffle, and cost metrics. AutoBot's anomaly detection surfaces regressions and unexpected slowdowns before they impact downstream pipelines.

💰

Cost Insights

Track compute cost per notebook run, per pipeline, and per team. Identify the most expensive workloads and right-size your cluster configuration with data-driven recommendations.

🔔

Email Notifications & Audit Ops

Configurable email alerts for failures, completions, and anomalies. Every run is logged with auth context and audit-friendly operation metadata — ready for compliance review.

☁️

Databricks-Ready Deployment

Native Databricks integration with support for Docker and Kubernetes. Deploy AutoBot in your existing cloud infrastructure with minimal configuration and no vendor lock-in.

AutoBot Dashboard — Jobs, Clusters & Executions

AutoBot dashboard showing jobs overview, cluster metrics, and recent executions

Job Management — Submit, Monitor & Track

Job management interface with job submission form and execution history

Databricks-Ready · Docker & Kubernetes Friendly

Drop AutoBot into your pipeline today

Integrates with your existing Databricks workspace, Docker containers, or Kubernetes clusters — capturing execution metrics and anomaly signals from your very first run.

→ Master/child notebook hierarchy

→ WebSocket live monitoring

→ Anomaly detection & cost insights

→ Email alerts & audit logging

Learn More

PyFlow Parser

Transform, upgrade, and understand Python code — automatically

AST-based framework migration, Python and PySpark version upgrades, deep code analysis, and rich HTML reports — all from one toolkit.

100+

pandas operations in conversion scope

3.6–3.14

Python upgrade span

2.4–4.0

PySpark upgrade span

4

framework conversion directions

Framework Conversions

⚡

Pandas → Polars

Large API surface: DataFrame ops, group-by, joins, strings, datetimes, I/O, windows, resampling. Import rewrites and idiomatic Polars expressions with PEP 8-oriented output and comment preservation.

🔀

PySpark → Polars

Shrink operational complexity when local Polars fits your workload. Session and DataFrame patterns, SQL-style functions, joins, windows, aggregations, I/O — UDFs flagged for manual review.

🚀

Pandas → PySpark

Scale out pandas-style code to distributed Spark. GroupBy, windows, merges, pivots, and filtering — with SparkSession setup and import management handled automatically.

🔄

PySpark → Pandas

Bring Spark logic back to notebooks and local tests. Reverse mapping for common DataFrame operations. Windows and joins translated toward pandas idioms for fast iteration.

Version Upgrades

🐍

Python 3.6 → 3.14

AST-driven upgrades with cumulative rules across versions. Syntax, typing, stdlib deprecations, and library notes — each run writes upgraded Python plus a companion HTML report summarizing what changed, what to watch, and what needs manual review.

typing → builtins Union → | Optional → T | None

✦

PySpark 2.4 → 4.0

Parallel pipeline with its own rules and HTML summary. API modernization across major versions — SQLContext→SparkSession patterns, deprecated or breaking patterns surfaced for review, same reporting style as Python upgrades.

🔬

AST-Based Transforms

Python's AST preserves structure and semantics — mappings stay maintainable. No fragile regex-only rewrites. Default formatting via isort/autopep8. Comments, import sorting, and structure preserved automatically.

Version Upgrade CLI

# Python version upgrader
python converters/python_version_upgrader.py your_script.py --from 3.9 --to 3.12 -o out/upgraded.py

# PySpark version upgrader
python converters/pyspark_version_upgrader.py spark_job.py --from 2.4 --to 3.0 -o out/spark_job.py

Writes out/upgraded.html next to the upgraded file. Omit -o to use a default _py312-style suffix next to the input.

Code Analysis & Reports

PyFlow ships two distinct HTML report types: interactive analysis of visitor logs, and migration summaries for upgrade/converter runs — both first-class outputs.

Report Type 1 — PyFlow Analysis Report (visitor log → HTML)

Run src/run_analyzer.py on an ANTLR visitor log. Parses the log, runs PyFlowAnalyzer, optionally renders Graphviz graphs, and builds a full HTMLReportGenerator page.

python src/run_analyzer.py your_file.log
python src/run_analyzer.py your_file.log --no-graphs
python src/run_analyzer.py your_file.log --offline

Typical outputs: {basename}_analysis.html, {basename}_analysis.json, {basename}_enhanced.py, {basename}_regenerated.py, and graph images (flow, calls, dependencies) as PNG and SVG.

Overview & Metrics

Analysis overview, totals, complexity stats, function and block-level breakdowns.

Graphs

Program flow, function call relationships, and block dependency views (PNG + embedded SVG).

Code Views

Enhanced code with block annotations and an interactive enhanced-code section.

Core Stack

parser → analyzer → report + visualization (Graphviz).

Report Type 2 — Upgrade & Conversion HTML Reports

The version upgraders and framework converters emit a separate HTML file beside the transformed .py. These pages focus on migration accountability: what was upgraded, what imports moved, deprecations, and explicit manual-review items — not program graphs.

→ Framework converters under converters/ (pandas ↔ polars, pyspark ↔ polars) emit HTML alongside the new source.
→ Example outputs live under testing/examples_python/, testing/examples_pyspark/, and testing/demo_*.html.

Live Demo — Sample Analysis Report

PyFlow Analysis Report

See a real-world PyFlow Parser output — full code analysis with charts, metrics, call graphs, dependency maps, enhanced code views, and visitor-log-derived summaries generated from an actual codebase.

View Sample Report →

⌨️

CLI & API

Command-line tools for batch runs. Programmatic hooks for CI pipelines and custom workflows. Integrate PyFlow Parser directly into your development and migration automation.

👥

Who Benefits

Data scientists, engineers, platform owners, and modernization teams. Faster Polars or Spark adoption, less manual rewrite, and clearer reports on every change — from version upgrades to framework migrations.

Modules

Modernize faster across the full lifecycle

Six purpose-built modules, each with its own UI and analytics — all sharing a single lineage and metadata graph.

🔍 Code Analysis

Assess thousands of scripts instantly — map complexity, dependencies, and readiness. Get a prioritized plan, safer cutovers, and faster production go-lives.

🔗 Visual Lineage

Visualize code across jobs, tables, and SQL — sources, flows, and column-level changes. Speeds impact checks, lowers migration risk, and supports compliance audits.

⚡ Code Conversion

Convert legacy SAS, DataStage, BTEQ, and more into Python, PySpark, Snowpark, or SQL with matched outputs. Modernize faster, keep logic intact.

🗂️ Data Mapper

Automatically map legacy schemas to Snowflake or Databricks with clear, auditable mappings. Enforce naming, data types, and get full audit-ready visibility.

📝 Auto Docs

Automatic documentation captures legacy and target code — detailing components, parameters, and dependencies for clear traceability and compliance reporting.

✅ Data Matching

Compare source and target outputs at scale using configurable keys and rules. Flag mismatches, duplicates, and gaps with actionable reports for fast resolution.

Capabilities

Everything a modern Python data platform needs

From first keystroke to production deployment, PyFluent covers the full development lifecycle.

🤖 AI Development Assistant

Context-aware code generation from natural language
Pandas → PySpark migration with lineage preservation
Automatic type annotation and schema inference
Refactoring suggestions based on lineage impact
Explain any transformation in plain English

🔗 Column-Level Lineage

AST-based capture — no runtime instrumentation needed
Spans pandas, PySpark, SQLAlchemy, and custom code
Upstream / downstream impact analysis per column
Cross-notebook and cross-module dependency tracking
Export lineage as JSON, YAML, or OpenLineage format

📋 STTM Engine

Source-to-target mapping extracted from transformations
Supports direct, computed, aggregated, and lookup types
STTM tables exported to Excel, Markdown, HTML
Diff-based STTM versioning across pipeline releases
Compliance-ready audit reports in minutes

📝 Auto Documentation

Google, NumPy, and reST docstring styles
Data dictionary generation from schema + lineage
Markdown, HTML, and PDF output formats
Inline documentation rendered in notebook UI
Docs update automatically when code changes

⚡ Execution Framework

Dependency DAG with parallel branch execution
Local, Spark, Databricks, and serverless targets
Incremental processing with smart checkpoints
Native scheduler — no Airflow required
Profiler with flame graphs and memory tracking

🔒 Governance & Security

On-premise deployment — data never leaves your network
Column-level PII tagging propagated through lineage
RBAC: role-based access to notebooks and pipelines
Full audit trail: who ran what, when, with what result
Lineage exported for GDPR, CCPA, SOX compliance

📊 Data Quality Engine

Auto-generated data contracts from STTM + schema
Column-level quality rules inferred from lineage
Null rate, distribution shift, and type drift alerts
Regression test suite generated from notebook history
Quality metrics dashboard per pipeline and dataset

🧩 Integrations

VS Code extension with inline lineage and AI panel
JupyterLab plugin — drop-in enhancement
Git integration: lineage diffs alongside code diffs
dbt, Airflow, Prefect, and Great Expectations connectors
REST API + Python SDK for all platform capabilities

Competitive Differentiation

Why PyFluent

No other Python platform combines AI coding assistance, zero-annotation lineage capture, STTM, and automatic documentation in a single on-premise product.

Unique

Zero-Annotation Lineage from Pure Python

Other tools require decorators, schema registries, or manual lineage annotations. PyFluent captures column-level STTM by parsing your Python AST at import — no code changes, no instrumentation agents.

Unique

Documentation That Stays Current

AI-generated docs are written once and forgotten. PyFluent re-generates documentation every time code changes, using live lineage metadata — so your data dictionary is always accurate, always versioned.

Architecture

100% On-Premise, Single Binary

Deploy behind your firewall. No SaaS dependency. No telemetry. Your source code, lineage graphs, and documentation stay in your network — always. One Docker image, up in minutes.

Intelligence

Lineage-Aware AI

Unlike generic copilots, PyFluent's AI knows your full pipeline lineage when making suggestions. It won't suggest a transformation that would break downstream dependencies — because it can see them.

Getting Started

From install to lineage in 15 minutes

Drop PyFluent into your existing Python environment. Your first lineage graph renders before your first coffee refill.

Day 1

Install & Connect

Install the PyFluent server, open your first notebook in the PyFluent Studio or VS Code extension. Point it at your existing data sources — S3, Databricks, Snowflake, or local files. Lineage starts capturing immediately.

Week 1

Explore & Document

Review your auto-generated lineage graphs and STTM tables. Generate your first data dictionary and pipeline documentation. Share with your governance team — they'll ask what changed.

Month 1

Govern & Scale

Enable data quality rules, configure compliance exports, set up impact analysis alerts. Onboard your full data team — every notebook they open immediately gains lineage and AI assistance.

Write Python Faster.Understand It Completely.Ship with Confidence.

Python is everywhere — but nobody knows what the code actually does

Notebooks that think with you — not just run for you

AI-Integrated Code Cells

Live Lineage Sidebar

Interactive Data Previews

Versioned Cell History

Smart Cell Ordering

One-Click Export

Five steps from raw code to production

Analyze. Inventory. Lineage.

Convert. Generate modern code.

Execute. Orchestrate pipelines.

Validate. Prove parity.

Merlin AI. Assist and accelerate.

Column-level lineage — captured automatically from Python code

Zero-Annotation Capture

Impact Analysis

Cross-File Dependency Graph

Documentation that writes itself — and stays current

From notebook to production — without rewriting anything

Dependency-Aware DAG Runner

Multi-Target Execution

Incremental & Checkpoint Execution

Lineage-Aware Testing

Execution Profiler

Native Scheduler

See every step. On Snowflake and Databricks.

Manage, execute, and monitor PySpark notebooks — production-ready

Hierarchical Notebook Execution

Real-Time WebSocket Monitoring

Performance Metrics & Anomaly Detection

Cost Insights

Email Notifications & Audit Ops

Databricks-Ready Deployment

Drop AutoBot into your pipeline today

Transform, upgrade, and understand Python code — automatically

Pandas → Polars

PySpark → Polars

Pandas → PySpark

PySpark → Pandas

Python 3.6 → 3.14

PySpark 2.4 → 4.0

AST-Based Transforms

PyFlow Analysis Report

CLI & API

Who Benefits

Modernize faster across the full lifecycle

Everything a modern Python data platform needs

🤖 AI Development Assistant

🔗 Column-Level Lineage

📋 STTM Engine

📝 Auto Documentation

⚡ Execution Framework

🔒 Governance & Security

📊 Data Quality Engine

🧩 Integrations

Built for regulated data environments

Why PyFluent

Zero-Annotation Lineage from Pure Python

Documentation That Stays Current

100% On-Premise, Single Binary

Lineage-Aware AI

From install to lineage in 15 minutes

Install & Connect

Explore & Document

Govern & Scale

See your Python pipelines — the way they were meant to be seen.

Write Python Faster.
Understand It Completely.
Ship with Confidence.